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GPUs  as  Compute  Engines 


1 0  years  ago: 

•  Graphics  done  in  software 
5  years  ago: 

•  Full  graphics  pipeline 
Today: 

•  40x  geometry,  13x  fill  vs.  5  yrs  ago 

•  Programmable! 


Programmable,  data  parallel 
processing  on  every  desktop 

Enormous  opportunity  to  change  the 
way  commodity  computing  is  done! 


The  Programmable  Rendering  Pipeline 
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Compute  3D  geometry 
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Geometry 

Transform  geometry  from  3D 
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Composite 

Combine  fragments  into  image 

GPU 

NVIDIA  GeForce  6800  3D  Pipeline 


Z-Cull 


Triangle  Setup 
Shader  Instruction  Dispatch 
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Courtesy  Nick  Triantos,  NVIDIA 


Recent  GPU  Performance  Trends 
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Recent  GPU  Performance  Trends 

32-bit  FP  multiplies  per  second 

50  i 
40 -i 

LO 

a. 

3  30H 

Ll. 

o 

20 -j 
10-i 

0  J— i - 1 - 1 - \ - \ - \ - 

July  01  Jan  02  July  02  Jan  03  July  03  Jan  04 

_ Courtesy  Pat  Hanrahan/  David  Luebke 


Recent  GPU  Performance  Trends 
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Why  Are  GPUs  Fast? 

Characteristics  of  computation  permit  efficient 
hardware  implementations 

•  High  amount  of  parallelism  ... 

•  ...  exploited  by  graphics  hardware 

•  High  latency  tolerance  and  feed-forward  dataflow  ... 

•  ...  allow  very  deep  pipelines 

•  ...  allow  optimization  for  bandwidth  not  latency 

Simple  control 

•  Restrictive  programming  model 

Competition  between  vendors 

What  about  programmability?  Effect  on 
performance?  How  hard  to  program? 


Programming  a  GPU  for  GP  Programs 


Programming  a  GPU  for  GP  Programs 

♦  Draw  a  screen-sized 
quad 


Programming  a  GPU  for  GP  Programs 


Draw  a  screen-sized 
quad 

Run  a  SIMD  program 
over  each  fragment 


Programming  a  GPU  for  GP  Programs 


Draw  a  screen-sized 
quad 

Run  a  SIMD  program 
over  each  fragment 

“Gather”  is  permitted 
from  texture  memory 


Programming  a  GPU  for  GP  Programs 
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Draw  a  screen-sized 
quad 

Run  a  SIMD  program 
over  each  fragment 

“Gather”  is  permitted 
from  texture  memory 

Resulting  buffer  can 
be  treated  as  texture 
on  next  pass 


GPU  Programming  is  Hard 

Must  think  in  graphics  metaphors 

Requires  parallel  programming  (CPU-GPU, 
task,  data,  instruction) 

Restrictive  programming  models  and 
instruction  sets 

Primitive  tools 

Rapidly  changing  interfaces 


Challenge:  Programming  Systems 
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Model 
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Languages 


Performance  Analysis  Tools 


Compilers 


CPU 

Scalar 

STL,  GNU  SL,  MPI,  ... 
C,  Fortran, ... 
gcc,  vendor-specific,  ... 
gdb,  vtune,  Purify, ... 
Lots 

-*■  applications 


GPU 

Stream? 

GLSL,  Cg,  HLSL,  ... 
Vendor-specific 
Shadesmith,  NVPerfHUD 

None 
->  kernels 


Brook:  General-Purpose  Streaming  Language 
Stream  programming  model 

•  Treats  GPU  as  streaming  coprocessor 

•  Streams  enforce  data  parallel  computing 

•  Kernels  encourage  arithmetic  intensity 

•  Streams  and  kernels  explicitly  specified 

C  with  stream  extensions 

Open-source:  www.sf.net/projects/brook/ 

Ian  Buck  et  al.,  “Brook  for  GPUs:  Stream 
Computing  on  Graphics  Hardware”, 

Siggraph  2004 


Challenge:  GPU-to-Host  Bandwidth 


GPUs  lack  band¬ 
width  to  the  host, 
so  we  won't  use  it! 


No  one  uses  host 
bandwidth,  so  we 
won't  optimize  it! 
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Challenge:  GPU-to-Host  Bandwidth 


GPUs  lack  band¬ 
width  to  the  host, 
so  we  won't  use  it! 


PCI 


No  one  uses  host 
bandwidth,  so  we 
won't  optimize  it! 


•  PCI-E  optimizes  GPU-to-CPU  bandwidth 

•  16-lane  card:  8  GB/s 

•  Scalable  in  future 

•  Major  vendors  support  PCI-E  cards  now 

•  Multiple  GPUs  supported  per  CPU  -  opportunity! 

•  Cheap  and  upgradable 


Challenge:  Mobile/embedded  market 


Why? 

•  Ul,  messaging/screen  savers,  navigation, 
gaming  (location  based) 

Typical  specs  (cell-phone  class): 

•  200-800k  gates,  -100  MHz,  -100  mW 


F2  POWERVR 

Visionary  IP 


•  1-10M  vtx/s,  100+M  frags/s 

What’s  important? 

•  Visual  quality 

•  Power-efficient  (ops/W) 

•  Avoid  memory  accesses,  unified  shaders  ... 


Enqine 


Challenge:  Power 

Desktop: 

•  Double-width  cards 

•  Workstation  power 
supplies;  draw  power 
from  motherboard 

Mobile: 

•  Batteries  improving  5- 
10%  per  year 

•  Ops/W  most  important 

_ www.coolinqzone.com 


Current  GPGPU  Research 


Image  processing  [Johnson/Frank/Vaidya, 
LLNL] 

Alternate  graphics  pipelines  [Purcell, 

Carr,  Coombe] 

Visual  simulation  [Harris] 

Volume  rendering  [Kniss,  KrUger] 

Level  set  computation  [Lefohn,  Strzodka] 

Numerical  methods  [Bolz,  Kruger, 
Strzodka] 

Molecular  dynamics  [Buck] 

Databases  [Sun,  Govindaraju] 


•  •  • 


Grand  Challenges 

Architecture:  Increase  features  and 
performance  without  sacrificing  core 
mission 

Interfaces:  Abstractions,  APIs,  programming 
models,  languages 

•  Many  approaches  needed 

•  Goal:  C  programs  compiling  to  dynamically- 
balanced  CPU-GPU  clusters 

•  Academic  and  research  community 

Applications:  Killer  app  needed! 
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For  more  information  ... 


GPGPU  home:  http://www.gpgpu.org/ 


•  Mark  Harris,  UNC/NVIDIA 

GPU  Gems  (Addison-Wesley) 

•  Vol  1:  2004;  Vol  2:  2005 


Conferences:  Siggraph,  Graphics  Hardware, 
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•  Course  notes:  Siggraph  ‘04,  IEEE  Visualization  ‘04 
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